Introduction

About this handbook

About this handbook

Purpose

This “handbook” will serve as a reference guide specifically for epidemiologists using R, providing step-by-step examples of how to complete common epi tasks and outputs, with clear instructions and code.

The problem:

  • Most online R help resources are not task-centered nor epidemiology-focused
  • Epis learning or new to R often must Google and skim dozens of forum pages to complete common data manipulation and visualization epi tasks
  • Field epidemiologists often work in low internet-connectivity environments and have limited technical support from HQ

How to read this handbook:

  • This handbook in an HTML file. It is not online, you are only using your web browser to view this local file.

  • This handbook is best viewed with Google Chrome. Some functions may not work in other browsers.

  • Use tabs on the right to hide/view code. See ‘Copy to clipboard’ icon in the upper-right of each code section

Version
The latest version of this handbook can be found at this github repository.

Style

  • The handbook gives one recommended way of completing a task, and offers other methods/packages as appropriate.
  • The handbook generally uses tidyverse R codying style. Read more here (TODO).
  • Package names are written in bold (e.g. dplyr) and functions are like this: mutate().
  • Code in this handbook sometimes explicity names the package of a function (e.g. dplyr::mutate()), so it is clear to the reader which package is being used.

Note types

FOR EXAMPLE: This is a boxed example

NOTE: This is a note

TIP: This is a tip.

CAUTION: This is a cautionary note.

DANGER: This is a warning.

Datasets used

Datasets used

Here the datasets used in this handbook will be described and will be “downloadable” via link (the files will be stored within the HTML, so available offline as well)

  • Linelist (…) Linelist for the 2013 (first wave) H7N9 outbreak in China (source)
  • Aggregated case counts (…)
  • GIS coordinates (…)
  • GIS shapefile (…)
  • modeling dataset? (…)

Contributors

Contributors

Maintainer: Neale Batra ()

Code contributors:

Data contributors: outbreaks package

Content provided by these people…. a…b…c…d…

Review provided by these people…

Some of this material comes from the R4Epis website, which was also made by some of the same people…

RECON

Photo credits (logo): CDC Public Image gallery; R Graph Gallery

This is one page of the R Handbook for Epidemiologists, but is being printed as a stand-alone page.

You can find the complete handbook HERE.

R Basics

Overview

This section is not meant as a comprehensive “how to learn R” tutorial. However, it does cover some of the fundamentals that can be good to reference or refresh.

More comprehensive tutorials are available online:
* Here
* and Here
* and even Here
* Oh yea and Here too (there’s a lot of them)

Why use R?

Why use R?

  • Reproducibility
  • Fewer errors
  • Collaboration
  • Free

Installation

Installation

How to install R

How to install R Studio

Other things you may need to install:
* TinyTeX
* Pandoc
* RTools

RStudio

RStudio

RStudio Orientation

First, open RStudio. As their icons can look very similar, be sure you are opening RStudio and not R.

For RStudio to function you must also have R installed on the computer (see this section for installation instructions).

RStudio is an interface (GUI) for easier use of R. You can think of R as being the engine of a vehicle, doing the crucial work, and RStudio as the body of the vehicle (with seats, accessories, etc.) that helps you actually use the engine to move forward!

By default RStudio displays four rectangle panes.

TIP: If your RStudio displays only one left pane it is because you have no scripts open yet.

The R Console Pane

The R Console, by default the left or lower-left pane in R Studio, is the home of the R “engine”. This is where the commands are actually run and non-graphic outputs and error/warning messages appear. You can directly enter and run commands in the R Console, but realize that these commands are not saved as they are when running commands from a script.

If you are familiar with Stata, the R Console is like the Command Window and also the Results Window.

The Source Pane
This pane, by default in the upper-left, is space to edit and run your scripts. This pane can also display datasets (data frames) for viewing.

For Stata users, this pane is similar to your Do-file and Data Editor windows.

The Environment Pane
This pane, by default the upper-right, is most often used to see brief summaries of objects in the R Environment in the current session. These objects could include imported, modified, or created datasets, parameters you have defined (e.g. a specific epi week for the analysis), or vectors or lists you have defined during analysis (e.g. names of regions). Click on the arrow next to a dataframe name to see its variables.

In Stata, this is most similar to Variables Manager window.

Plots, Packages, and Help Pane
The lower-right pane includes several tabs including plots (display of graphics including maps), help, a file library, and available R packages (including installation/update options).

This pane contains the Stata equivalents of the Plots Manager and Project Manager windows.

RStudio settings

Change RStudio settings and appearance in the Tools drop-down menu, by selecting Global Options

Scripts

Scripts

Why use a script

R scripts (vs. typing in the console)
* Advantages (reproducability) * General sequence (into, load packages, load data, clean data, conduct analysis, save results) * Commenting

Rmarkdown

R notebooks

RShiny

Working directory

Working directory

These tabs cover how to use R working directories, and how this changes when you are working within an R project. The working directory is the root file location used by R for your work.
By default, it will save new files and outputs to this location, and will look for files to import (e.g. datasets) here as well.

NOTE: If using an [R project](#rproject), the working directory will default to the R project root folder **IF** you open RStudio by clicking open the R project (the file with .rproj extension))

Set by Command

Use the command setwd() with the filepath in quotations, for example: setwd("C:/Documents/R Files")

CAUTION: If using an RMarkdown script be aware of the following:

In an R Markdown script, the default working directory is the folder the Rmarkdown file (.Rmd) is saved to. If you want to change this, you can use setwd() as above, but know the change will only apply to that specific code chunk.

To change the working directory for all code chunks in an R markdown, edit the setup chunk to add the root.dir = parameter, such as below:

knitr::opts_knit$set(root.dir = 'desired/filepath/here')

Set Manually

Setting your working directory manually (point-and-click)

From RStudio click: Session / Set Working Directory / Choose Directory (you will have to do this each time you open RStudio)

In an R project

How things change in an R project

Objects

Objects

Everything in R is an object. These sections will explain:

  • How to create objects (<-)
  • Types of objects (e.g. data frames, vectors..)
  • How to access subparts of objects (e.g. variables in a dataset)
  • Classes of objects (e.g. numeric, character, factor)

Everything is an object

Everything you store in R - datasets, variables, a list of village names, a total population number, even outputs such as graphs - are objects which are assigned a name and can be referenced in later commands.

An object exists when you have assigned it a value (see the assignment section below). When it is assigned a value, the object appears in the Environment (see the upper right pane of RStudio). It can then be operated upon, manipulated, changed, and re-defined.

Creating objects (<-)

Create objects by assigning them a value with the <- operator.
You can think of the assignment operator <- as the words “is defined as”. Assignment commands generally follow a standard order:

object_name <- value (or process/calculation that produce a value)

EXAMPLE: You may want to record the current epidemiological reporting week as an object for reference in later code. In this example, the object reporting_week is created when it is assigned the character value "2018-W10" (the quote marks make these a character value).
The object reporting_week will then appear in the RStudio Environment pane (upper-right) and can be referenced in later commands.

See the R commands and their output in the boxes below.

reporting_week <- "2018-W10"   # this command creates the object reporting_week by assigning it a value
reporting_week                 # this command prints the current value of reporting_week object in the console
## [1] "2018-W10"

NOTE: Note the [1] in the R console output is simply indicating that you are viewing the first item of the output

CAUTION: An object’s value can be over-written at any time by running an assignment command to re-define its value. Thus, the order of the commands run is very important.

The following command will re-define the value of reporting_week:

reporting_week <- "2018-W51"   # assigns a NEW value to the object reporting_week
reporting_week                 # prints the current value of reporting_week in the console
## [1] "2018-W51"

Datasets are also objects and must be assigned names when they are imported.

In the code below, the object linelist_raw is created and assigned the value of a CSV file imported with the rio package.

# linelist_raw is created and assigned the value of the imported CSV file
linelist <- rio::import("my_linelist.csv")

You can read more about importing and exporting datasets with the section on importing data.

CAUTION: A quick note on naming of objects:

  • Object names must not contain spaces, but you should use underscore (_) or a period (.) instead of a space.
  • Object names are case-sensitive (meaning that Dataset_A is different from dataset_A).
  • Object names must begin with a letter (cannot begin with a number like 1, 2 or 3).

Object Structure

Objects can be a single piece of data (e.g. my_number <- 24), or they can consist of structured data.

The graphic below, sourced from this online R tutorial shows some common data structures and their names. Not included in this image is spatial data, which is discussed in the GIS section.

In epidemiology (and particularly field epidemiology), you will most commonly encounter data frames and vectors:

Common structure Explanation Example from templates
Vectors A container for a sequence of singular objects, all of the same class (e.g. numeric, character). “Variables” (columns) in data frames are vectors (e.g. the variable age_years).
Data Frames Vectors (e.g. columns) that are bound together that all have the same number of rows. linelist_raw and linelist_cleaned are both data frames.

Note that to create a vector that “stands alone”, or is not part of a data frame (such as a list of location names), the function c() is often used:
list_of_names <- c("Ruhengeri", "Gisenyi", "Kigali", "Butare")

Variables ($)

Vectors within a data frame (variables in a dataset) can be called, referenced, or created using the $ symbol. The $ symbol connects the name of the column to the name of its data frame. The $ symbol must be used, otherwise R will not know where to look for or create the column.

# Retrieve the length of the vector age_years
length(linelist$age) # (age is a variable in the linelist data frame)

By typing the name of the data frame followed by $ you will also see a list of all variables in the data frame. You can scroll through them using your arrow key, select one with your Enter key, and avoid spelling mistakes!

knitr::include_graphics(here::here("images", "Calling_Names.gif"))

Object Classes

All the objects stored in R have a class which tells R how to handle the object. There are many possible classes, but common ones include:

Class Explanation Examples
Character These are text/words/sentences “within quotation marks”. Math cannot be done on these objects. “Character objects are in quotation marks”
Numeric These are numbers and can include decimals. If within quotation marks the will be considered character. 23.1 or 14
Integer Numbers that are whole only (no decimals) -5, 14, or 2000
Factor These are vectors that have a specified order or hierarchy of values Variable msf_involvement with ordered values N, S, SUB, and U.
Date Once R is told that certain data are Dates, these data can be manipulated and displayed in special ways. See the page on Dates for more information. 2018-04-12 or 15/3/1954 or Wed 4 Jan 1980
Logical Values must be one of the two special values TRUE or FALSE (note these are not “TRUE” and “FALSE” in quotation marks) TRUE or FALSE
data.frame A data frame is how R stores a typical dataset. It consists of vectors (columns) of data bound together, that all have the same number of observations (rows). The example AJS dataset named linelist_raw contains 68 variables with 300 observations (rows) each.

You can test the class of an object by feeding it to the function class(). Note: you can reference a specific column within a dataset using the $ notation to separate the name of the dataset and the name of the column.

class(linelist$age)     # class should be numeric
## [1] "numeric"

class(linelist$gender)  # class should be character
## [1] "character"

Often, you will need to convert objects or variables to another class.

Function Action
as.character() Converts to character class
as.numeric() Converts to numeric class
as.integer() Converts to integer class
as.Date() Converts to Date class - Note: see section on dates for details
as.factor() Converts to factor - Note: re-defining order of value levels requires extra arguments

Here is more online material on classes and data structures in R.

Functions

Functions

This section on functions explains:
* What a function is and how they work
* What arguments are
* What packages are
* How to get help understanding a function

Simple Functions

A function is like a machine that receives inputs, does some action with those inputs, and produces an output.
What the output is depends on the function.

Functions typically operate upon some object placed within the function’s parentheses. For example, the function sqrt() calculates the square root of a number:

sqrt(49)
## [1] 7

Functions can also be applied to variables in a dataset. For example, when the function summary() is applied to the numeric variable age in the dataset linelist (what’s the $ symbol?), the output is a summary of the variable’s numeric and missing values.

summary(linelist$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    2.00   47.25   60.00   56.91   72.00   91.00       2

NOTE: Behind the scenes, a function represents complex additional code that has been wrapped up for the user into one easy command.

Functions with Multiple Arguments

Functions often ask for several inputs, called arguments, located within the parentheses of the function, usually separated by commas.

  • Some arguments are required for the function to work correctly, others are optional.
  • Optional arguments have default settings if they are not specified.
  • Arguments can take character, numeric, logical (TRUE/FALSE), and other inputs.

For example, this age_pyramid() command produces an age pyramid graphic based on defined age groups and a binary split variable, such as gender. The function is given three arguments within the parentheses, separated by commas. The values supplied to the arguments establish linelist as the data frame to use, age_group as the variable to count, and gender as the binary variable to use for splitting the pyramid by color.

NOTE: For this example, in the background we have created a new variable called “age_group”. To learn how to create new variable see that section of this handbook

# Creates an age pyramid by specifying the dataframe, age group variable, and a variable to split the pyramid
apyramid::age_pyramid(data = linelist, age_group = "age_group", split_by = "gender")

The first half of an argument assignment (e.g. data =) does not need to be specified if the arguments are written in a specific order (specified in the function’s documentation). The below code produces the exact same pyramid as above, because the function expects the argument order: data frame, age_group variable, split_by variable.

# This command will produce the exact same graphic as above
apyramid::age_pyramid(linelist, "age_group", "gender")

A more complex age_pyramid() command might include the optional arguments to:

  • Show proportions instead of counts (set proportional = TRUE when the default is FALSE)
  • Specify the two colors to use (pal = is short for “palette” and is supplied with a vector of two color names. See the objects page for how the function c() makes a vector)

NOTE: For arguments specified with an equals symbol (e.g. coltotals = ...), their order among the arguments is not important (must still be within the parentheses and separated by commas).

age_pyramid(linelist, "age_group", "gender", proportional = TRUE, pal = c("orange", "purple"))

Packages

Packages contain functions.

On installation, R contains “base” functions that perform common elementary tasks. But many R users create specialized functions, which are verified by the R community and which you can download as a package for your own use.

One of the more challenging aspects of R is that there are often many functions or packages to choose from to complete a given task.

Functions are contained within packages which can be downloaded (“installed”) to your computer from the internet. Once a package is downloaded, you access its functions by loading the package with the library() command at the beginning of each R session.

# this loads the package "tidyverse" for use in the current R session
library(tidyverse)

NOTE: While you only have to install a package once, you must load it at the beginning of every R session using library() command, or an alternative like pacman’s p_load() function.

Think of R as your personal library: When you download a package your library gains a book of functions, but each time you want to use a function in that book, you must borrow that book from your library.

For clarity in this handbook, functions are usually preceeded by the name of their package using the :: symbol in the following way:

package_name::function_name()

Once a package is loaded for a session, this explicit style is not necessary. One can just use function_name(). However giving the package name is useful when a function name is common and may exist in multiple packages (e.g. plot()).
Using the package name will also load the package if it is not already loaded.

# This command uses the package "rio" and its function "import()" to import a dataset
linelist <- rio::import("linelist.xlsx", which = "Sheet1")

Dependencies
Packages often depend on other packages, and these are called “dependencies”. When a package is installed from CRAN, it will typically also install its dependenices.

Function Help

To read more about a function, you can try searching online for resources OR search in the Help tab of the lower-right RStudio pane.

Piping

Piping (%>%)

Two general approaches to R coding are:

  1. Tidyverse - piping an object from function to function
  2. defining intermediate objects (and older method, still worth knowing about)

Piping

Simply explained, the pipe operator (%>%) passes an intermediate output from one function to the next.
You can think of it as saying “then”. Many functions can be linked together with %>%.

  • Piping emphasizes a sequence of actions, not the object the actions are being performed on

  • Best when a sequence of actions must be performed on one object

  • from magrittr. Included in dplyr and tidyverse

  • Makes code more clean and easier to read, intuitive

  • express a sequence of operations

  • the object is altered and then passed on to the next function

Example:

# A fake example of how to bake a care using piping syntax

cake <- flour %>%       # to define cake, start with flour, and then...
  left_join(eggs) %>%   # add eggs
  left_join(oil) %>%    # add oil
  left_join(water) %>%  # add water
  mix_together(utensil = spoon, minutes = 2) %>%                # mix together
  bake(degrees = 350, system = "fahrenheit", minutes = 35) %>%  # bake
  let_cool()            # let it cool down

https://cfss.uchicago.edu/notes/pipes/#:~:text=Pipes%20are%20an%20extremely%20useful,code%20and%20combine%20multiple%20operations.

Piping is not a base function. To use piping, the dplyr package must be installed and loaded. Near the top of every template script is a code chunk that installs and loads the necessary packages, including dplyr. You can read more about piping in the documentation.

CAUTION: Remember that even when using piping to link functions, if the assignment operator (<-) is present, the object to the left will still be over-written (re-defined) by the right side.

TODO %<>% shortcut for re-defining the object and piping

Intermediate objects

Better if:
* You need to manipulate multiple objects
* There are intermediate steps that are meaningful and deserve separate object names

as changes are made - still handy to know

Risks: creating new objects for each step - lots of objects. If you use the wrong one you might not know. naming can be confusing, errors not easily detectable

either name each intermediate object, or overwrite the original, or combine all the functions together. all come with risks

https://style.tidyverse.org/pipes.html

# a fake example of how to bake a cake using this method (defining intermediate objects)
batter_1 <- left_join(flour, eggs)
batter_2 <- left_join(batter_1, oil)
batter_3 <- left_join(batter_2, water)

batter_4 <- mix_together(object = batter_3, utensil = spoon, minutes = 2)

cake <- bake(batter_4, degrees = 350, system = "fahrenheit", minutes = 35)

cake <- let_cool(cake)

Combine all functions together - also difficult to read

# an example of combining/nesting mutliple functions together - difficult to read
cake <- let_cool(bake(mix_together(batter_3, utensil = spoon, minutes = 2), degrees = 350, system = "fahrenheit", minutes = 35))

Operators

Operators

This section details operators in R, such as:
* Relational operators (less than, equal too..)
* Logical operators (and, or…)
* Missingness
* Mathematical operators (+, -, /…) * The %in% operator

Relational and logical operators

Relational operators compare values and are often used when defining new variables and subsets of datasets. Here are the common relational operators in R:

Function Operator Example Example Result
Equal to == "A" == "a" FALSE (because R is case sensitive) Note that == (double equals) is different from = (single equals), which acts like the assignment operator <-
Not equal to != 2 != 0 TRUE
Greater than > 4 > 2 TRUE
Less than < 4 < 2 FALSE
Greater than or equal to >= 6 >= 4 TRUE
Less than or equal to <= 6 <= 4 FALSE
Value is missing is.na() is.na(7) FALSE (see section on missing values)
Value is not missing !is.na() !is.na(7) TRUE

Logical operators, such as AND and OR, are often used to connect relational operators and create more complicated criteria. Complex statements might require parentheses ( ) for grouping and order of application.

Function Operator
AND &
OR | (vertical bar)
Parentheses ( ) Used to group criteria together and clarify order

For example, below, we have a linelist with two variables we want to use to create our case definition, hep_e_rdt, a test result and other_cases_in_hh, which will tell us if there are other cases in the household. The command below uses the function case_when() to create the new variable case_def such that:

linelist_cleaned <- linelist_cleaned %>%
  mutate(case_def = case_when(
    is.na(hep_e_rdt) & is.na(other_cases_in_hh)           ~ NA_character_,
    hep_e_rdt == "Positive"                               ~ "Confirmed",
    hep_e_rdt != "Positive" & other_cases_in_hh == "Yes"  ~ "Probable",
    TRUE                                                  ~ "Suspected"
  ))
Criteria in example above Resulting value in new variable “case_def”
If the value for variables hep_e_rdt and other_cases_in_hh are missing NA (missing)
If the value in hep_e_rdt is “Positive” “Confirmed”
If the value in hep_e_rdt is NOT “Positive” AND the value in other_cases_in_hh is “Yes” “Probable”
If one of the above criteria are not met “Suspected”

{{% notice tip %}} Note that R is case-sensitive, so “Positive” is different than “positive”… {{% /notice %}}

Missing Values

In R, missing values are represented by the special value NA (capital letters N and A - not in quotation marks). If you import data that records missing data in another way (e.g. 99, “Missing”, or .), you may want to re-code those values to NA.

To test whether a value is NA, use the special function is.na(), which returns TRUE or FALSE.

rdt_result <- c("Positive", "Suspected", "Positive", NA)   # two positive cases, one suspected, and one unknown
is.na(rdt_result)  # Tests whether the value of rdt_result is NA
## [1] FALSE FALSE FALSE  TRUE

Mathematical operators

Mathematical operators are often used to perform addition, division, to create new columns, etc. Below are common mathematical operators in R. Whether you put spaces around the operators is not important.

Objective Example in R
addition 2 + 3
subtraction 2 - 3
multiplication 2 * 3
division 30 / 5
exponent 2^3
order of operations ( )

%in%

Loading Packages

Loading Packages

This section describes the several ways to install a package:
* Via the online package repository (CRAN)
* From a ZIP file
* From Github

CRAN

ZIP files

Github

Errors & Warnings

Errors & Warnings

This section explains:
* General syntax for writing R code
* Code assists
* the difference between errors and warnings

Common errors and warnings and their solutions can be found in X section (TODO).

General Syntax

A few things to remember when writing commands in R, to avoid errors and warnings:

  • Always close parentheses - tip: count the number of opening “(” and closing parentheses “)” for each code chunk
  • Avoid spaces in column and object names. Use underscore ( _ ) or periods ( . ) instead
  • Keep track of and remember to separate a function’s arguments with commas
  • R is case-sensitive, meaning Variable_A is different from variable_A

Code assists

Any script (RMarkdown or otherwise) will give clues when you have made a mistake. For example, if you forgot to write a comma where it is needed, or to close a parentheses, RStudio will raise a flag on that line, on the right side of the script, to warn you.

(/images/Warnings_and_Errors.png)

Errors and Warnings

When a command is run, the R Console may show you warning or error messages in red text.

  • A warning means that R has completed your command, but had to take additional steps or produced unusual output that you should be aware of.

  • An error means that R was not able to complete your command.

Look for clues:

  • The error/warning message will often include a line number for the problem.

  • If an object “is unknown” or “not found”, perhaps you spelled it incorrectly, forgot to call a package with library(), or forgot to re-run your script after making changes.

If all else fails, copy the error message into Google along with some key terms - chances are that someone else has worked through this already!

Recommended training

Importing data

Overview

Introduction to importing data

Packages

The key package we recommend for importing data is: rio. rio offers the useful function import() which can import many types of files into R.

The alternative to using rio would be to use functions from several other packages that are specific to a type of file (e.g. read.csv(), read.xlsx(), etc.). While these alternatives can be difficult to remember, always using rio::import() is relatively easy.

Optionally, the package here can be used in conjunction with rio. It locates files on your computer via relative pathways, usually within the context of an R project. Relative pathways are relative from a designated folder location, so that pathways listed in R code will not break when the script is run on a different computer.

This code chunk shows the loading of packages for importing data.

# Checks if package is installed, installs if necessary, and loads package for current session
pacman::p_load(rio, here)

import()

When you import a dataset, you are doing the following:

  1. Creating a new, named data frame object in your R environment
  2. Defining the new object as the imported dataset

The function import() from the package rio makes it easy to import many types of data files.

# An example:
#############
library(rio)                                                     # ensure package rio is loaded for use

# New object is defined as the imported data
my_csv_data <- import("linelist.csv")                            # importing a csv file

my_Excel_data <- import("observations.xlsx", which = "February") # import an Excel file

import() uses the file’s extension (e.g. .xlsx, .csv, .dta, etc.) to appropriately import the file. Any optional arguments specific to the filetype can be supplied as well.

You can read more about the rio package in this online vignette

CAUTION: In the example above, the datasets are assumed to be located in the working directory, or the same folder as the script.

Import from filepath

A filepath can be provided in full (as below) or as a relative filepath (see next tab). Providing a full filepath can be fast and may be the best if referencing files from a shared/network drive).

The function import() (from the package rio) accepts a filepath in quotes. A few things to note:

  • Slashes must be forward slashes, as in the code shown. This is NOT the default for Windows filepaths.
  • Filepaths that begin with double slashes (e.g. “//…”) will likely not be recognized by R and will produce an error. Consider moving these files to a “named” or “lettered” drive that begins with a letter (e.g. “J:” or “C:”). See the section on using Network Drive for more details on this issue.
# A demonstration showing how to import a specific Excel sheet
my_data <- rio::import("C:/Users/Neale/Documents/my_excel_file.xlsx")

Import Excel sheet

If importing a specific sheet from an Excel file, include the sheet name in the which = argument of import(). For example:

# A demonstration showing how to import a specific Excel sheet
my_data <- rio::import("my_excel_file.xlsx", which = "Sheetname")

If using the here() method to provide a relative pathway to import(), you can still indicate a specific sheet by adding the which = argument after the closing parenthese of the here() function.

# Demonstration: importing a specific Excel sheet when using relative pathways with the 'here' package
linelist_raw <- import(here("data", "linelists", "linelist.xlsx"), which = "Sheet1")`  

Select file manually

You can import data manually via one of these methods:

  • Environment RStudio Pane, click “Import Dataset”, and select the type of data
  • Click File / Import Dataset / (select the type of data)
  • To hard-code manual selection, use the base R command file.choose() (leaving the parentheses empty) to trigger appearance of a pop-up window that allows the user to manually select the file from their computer. For example:
# A demonstration showing manual selection of a file. When this command is run, a POP-UP window should appear. 
# The filepath of the selected file will be supplied to the import() command.

my_data <- rio::import(file.choose())

TIP: The pop-up window may appear BEHIND your RStudio window.

Relative filepaths (here())

Relative filepaths differ from static filepaths in that they are relative from a R project root directory. For example:

  • A static filepath: import("C:/Users/nsbatra/My Documents/R files/epiproject/data/linelists/ebola_linelist.xlsx")
    • Specific fixed path
    • Useful if multiple users are running a script hosted on a network drive
  • A relative filepath: import(here("data", "linelists", "ebola_linelist.xlsx"))
    • Path is given in relation to a root directory (typically the root folder of an R project)
    • Best if working within an R project, or planning to zip and share entire project with others

The package here and it’s function here() facilitate relative pathways.

here() works best within R projects. When the here package is first loaded (library(here)), it automatically considers the top-level folder of your R project as “here” - a benchmark for all other files in the project.

Thus, in your script, if you want to import or reference a file saved in your R project’s folders, you use the function here() to tell R where the file is in relation to that benchmark.

If you are unsure where “here” is set to, run the function here() with the empty brackets:

# This command tells you the folder path that "here" is set to 
here::here()

Below is an example of importing the file “fluH7N9_China_2013.csv” which is located in the benchmark “here” folder. All you have to do is provide the name of the file in quotes (with the appropriate ending).

linelist <- import(here("fluH7N9_China_2013.csv"))

If the file is within a subfolder - let’s say a “data” folder - write these folder names in quotes, separated by commas, as below:

linelist <- import(here("data", "fluH7N9_China_2013.csv"))

Using the here() command produces a character filepath, which can then processed by the import() function.

# the filepath
here("data", "fluH7N9_China_2013.csv")

# the filepath is given to the import() function
linelist <- import(here("data", "fluH7N9_China_2013.csv"))

NOTE: You can still import a specific sheet of an excel file as noted in the Excel tab. The here() command only supplies the filepath.

Manual data entry

Entry by columns

Since a data frame is a combination of vertical vectors (columns), R by default expects manual entry of data to also be in vertical vectors (columns).

# define each vector (vertical column) separately, each with its own name
PatientID <- c(235, 452, 778, 111)
Treatment <- c("Yes", "No", "Yes", "Yes")
Death     <- c(1, 0, 1, 0)

CAUTION: All vectors must be the same length (same number of values).

The vectors can then be bound together using the function data.frame():

# combine the columns into a data frame, by referencing the vector names
manual_entry_cols <- data.frame(PatientID, Treatment, Death)

And now we display the new dataset:

# display the new dataset
DT::datatable(manual_entry_cols)

Entry by rows

Use the tribble function from the tibble package from the tidverse (onlinetibble reference).

Note how column headers start with a tilde (~). Also note that each column must contain only one class of data (character, numeric, etc.).
You can use tabs, spacing, and new rows to make the data entry more intuitive and readable. For example:

# create the dataset manually by row
manual_entry_rows <- tibble::tribble(
                        ~colA, ~colB,
                        "a",   1,
                        "b",   2,
                        "c",   3
                      )

And now we display the new dataset:

# display the new dataset
DT::datatable(manual_entry_rows)

Pasting from clipboard

If you copy data from elsewhere and have it on your clipboard, you can try the following command to convert those data into an R data frame:

manual_entry_clipboard <- read.table(file = "clipboard",
                                     sep = "t",           # separator could be tab, or commas, etc.
                                     header=TRUE)         # if there is a header row

Working with Dates

Overview

  • It is important to make R recognize when a variable contains dates.
  • Dates are an object class and can be tricky to work with.
  • Here we present several ways to convert date variables to Date class.

Packages

The following packages are recommended for working with dates:

# Checks if package is installed, installs if necessary, and loads package for current session

pacman::p_load(aweek,      # flexibly converts dates to weeks, and vis-versa
               lubridate,  # for conversions to months, years, etc.
               linelist,   # function to guess messy dates
               ISOweek)    # another option for creating weeks

as.Date()

The standard, base R function to convert an object or variable to class Date is as.Date() (note capitalization).

as.Date() requires that the user specify the existing* format of the date*, so it can understand, convert, and store each element (day, month, year, etc.) correctly. Read more online about as.Date().

If used on a variable, as.Date() therefore requires that all the character date values be in the same format before converting. If your data are messy, try cleaning them or consider using guess_dates() from the linelist package.

It can be easiest to first convert the variable to character class, and then convert to date class:

  1. Turn the variable into character values using the function as.character()
linelist_cleaned$date_of_onset <- as.character(linelist_cleaned$date_of_onset)
  1. Convert the variable from character values into date values, using the function as.Date()
    (note the capital “D”)
  • Within the as.Date() function, you must use the format= argument to tell R the current format of the date components - which characters refer to the month, the day, and the year, and how they are separated. If your values are already in one of R’s standard date formats (YYYY-MM-DD or YYYY/MM/DD) the format= argument is not necessary.

    • The codes are:
      %d = Day # (of the month e.g. 16, 17, 18…)
      %a = abbreviated weekday (Mon, Tues, Wed, etc.)
      %A = full weekday (Monday, Tuesday, etc.)
      %m = # of month (e.g. 01, 02, 03, 04)
      %b = abbreviated month (Jan, Feb, etc.)
      %B = Full Month (January, February, etc.)
      %y = 2-digit year (e.g. 89)
      %Y = 4-digit year (e.g. 1989)

For example, if your character dates are in the format DD/MM/YYYY, like “24/04/1968”, then your command to turn the values into dates will be as below. Putting the format in quotation marks is necessary.

linelist_cleaned$date_of_onset <- as.Date(linelist_cleaned$date_of_onset, format = "%d/%m/%Y")

TIP: The format= argument is not telling R the format you want the dates to be, but rather how to identify the date parts as they are before you run the command.

TIP:Be sure that in the format= argument you use the date-part separator (e.g. /, -, or space) that is present in your dates.

The as.character() and as.Date() commands can optionally be combined as:

linelist_cleaned$date_of_onset <- as.Date(as.character(linelist_cleaned$date_of_onset), format = "%d/%m/%Y")

If using piping and the tidyverse, the above command might look like this:

linelist_cleaned <- linelist_cleaned %>%
  mutate(date_of_onset = as.character(date_of_onset),
         date_of_onset = as.Date(date_of_onset, format = "%d/%m/%Y"))

Once complete, you can run a command to verify the class of the variable

# Check the class of the variable
class(linelist_cleaned$date_of_onset)  

Once the values are in class Date, R will by default display them in the standard format, which is YYYY-MM-DD.

lubridate

This is a section on using lubridate (Henry)

guess_dates()

The function guess_dates() attempts to read a “messy” date variable containing dates in many different formats and convert the dates to a standard format. You can read more online about guess_dates(), which is in the linelist package.

For example: guess_dates would see the following dates “03 Jan 2018”, “07/03/1982”, and “08/20/85” and convert them in the class Date to: 2018-01-03, 1982-03-07, and 1985-08-20.

linelist::guess_dates(c("03 Jan 2018", "07/03/1982", "08/20/85")) # guess_dates() not yet available on CRAN for R 4.0.2
                                                                  # try install via devtools::install_github("reconhub/linelist")

Some optional arguments for guess_dates() that you might include are:

  • error_tolerance - The proportion of entries which cannot be identified as dates to be tolerated (defaults to 0.1 or 10%)
  • last_date - the last valid date (defaults to current date)
  • first_date - the first valid date. Defaults to fifty years before the last_date.
# An example using guess_dates on the variable dtdeath
data_cleaned <- data %>% 
  mutate(dtdeath = linelist::guess_dates(dtdeath, error_tolerance = 0.1, first_date = "2016-01-01")

Excel Dates

Excel stores dates as the number of days since December 30, 1899. If the dataset you imported from Excel shows dates as numbers or characters like “41369”… use the as.Date() function to convert, but instead of supplying a format as above, supply an origin date.

NOTE: You should provide the origin date in R’s default date format ("YYYY-MM-DD").

# An example of providing the Excel 'origin date' when converting Excel number dates
data_cleaned <- data %>% 
  mutate(date_of_onset = as.Date(date_of_onset, origin = "1899-12-30"))

How dates are displayed

Once dates are the correct class, you often want them to display differently (e.g. in a plot, graph, or table). For example, to display as “Monday 05 Jan” instead of 2018-01-05. You can do this with the function format(), which works in a similar way as as.Date(). Read more in this online tutorial

%d = Day # (of the month e.g. 16, 17, 18…) %a = abbreviated weekday (Mon, Tues, Wed, etc.)
%A = full weekday (Monday, Tuesday, etc.)
%m = # of month (e.g. 01, 02, 03, 04)
%b = abbreviated month (Jan, Feb, etc.)
%B = Full Month (January, February, etc.)
%y = 2-digit year (e.g. 89)
%Y = 4-digit year (e.g. 1989)
%h = hours (24-hr clock)
%m = minutes
%s = seconds %z = offset from GMT
%Z = Time zone (character)

An example of formatting today’s date:

# today's date, with formatting
format(Sys.Date(), format="%d %B %Y")
## [1] "31 October 2020"

# easy way to get full date and time (no formatting)
date()
## [1] "Sat Oct 31 15:21:04 2020"

# formatted date, time, and time zone (using paste0() function)
paste0(format(Sys.Date(), format= "%A, %b %d '%y, %z  %Z, "), format(Sys.time(), format = "%H:%M:%S"))
## [1] "Saturday, Oct 31 '20, +0000  UTC, 15:21:04"

Calculating distance between dates

The difference between dates can be calculated by:

  1. Correctly formating both date variable as class date (see instructions above)
  2. Creating a new variable that is defined as one date variable subtracted from the other
  3. Converting the result to numeric class (default is class “datediff”). This ensures that subsequent mathematical calculations can be performed.

Converting dates/time zones

TODO

Epidemiological weeks

The templates use the very flexible package aweek to set epidemiological weeks. You can read more about it on the RECON website

Dates in Epicurves

See the section on epicurves.

Dates miscellaneous

  • Sys.Date( ) returns the current date of your computer
  • Sys.Time() returns the current time of your computer
  • date() returns the current date and time.

Cleaning data

Overview

Text about cleaning data, approaches, etc. renaming replace missing with dealing with cases (all lower, etc) case_when() factors

Packages

pacman::p_load(tidyverse, janitor, epitrix)

Clean Variable Names

Variable names are used so often, it is best to have them be “clean” (no spaces, no unusual characters, etc.)

The function clean_names() from the package janitor is very useful.

https://cran.r-project.org/web/packages/janitor/vignettes/janitor.html#cleaning

linelist <- linelist_raw %>% 
  janitor::clean_names()

Classes

See section on object classes

linelist <- linelist_raw %>% 
  mutate(age           = as.numeric(age),
         outcome       = as.character(outcome),
         date_of_onset = as.Date(date_of_onset, format = "%d/%m/%Y"),
         outcome       = factor(outcome, levels = c("Recover", "Death"))
         )

Creating new variables

base R method

linelist$new_var <- "new value"

Using dplyr mutate() transmute()

linelist <- linelist %>% 
  mutate(new_var_dup    = id,                  # new variable - replicate another variable
         new_var_static = 7,                   # new variable - all values the same
         new_var_static = new_var_static + 5   # you can modify a variable multiple times
         new_var_calc   = (age / 12) + months  # new variable - calculation
         new_var_paste  = paste0(district, "(", province, ")") # new variable - pasting together values

Groups by condition (case_when())

TODO tutorial on using case_when()

Numeric groups

For example, creating age groups cut()

case_when()

age_categories() (R4Epis package)

by percentile

Modifying existing values

Missing if… na_if()

Replace

Highest in hierarchy

Within a group, indicate/convert to the highest value in the group

Santa Clara County example - COVID contact tracing data - classification of multiple phone call records from same person into the highest category. (classify all as the highest of the group)

Missing Data

Overview

dealing with missing data percent missing over time etc.

Percent missing over time

Or change in percent of anything (X) over time, really.

lines <- linelist %>%
  mutate(date_of_onset = as.Date(date_of_onset, format = "%d/%m/%Y"),
         week = aweek::week2date(aweek::date2week(date_of_onset))) %>% 
  group_by(week) %>% 
  summarize(n_obs = n(),
            dt_hosp_missing = sum(date_of_hospitalisation == "" | is.na(date_of_hospitalisation)),
            dt_hosp_p_miss = dt_hosp_missing / n_obs,
            
            outcome_missing = sum(outcome == "" | is.na(outcome)),
            outcome_p_miss = outcome_missing / n_obs) %>%
  reshape2::melt(id.vars = c("week")) %>%
  filter(grepl("_p_", variable)) %>% 
  
  ggplot()+
    geom_line(aes(x = week, y = value, group = variable, color = variable), size = 1, stat = "identity")+
    labs(title = "Missingness in variables, as proportion of ",
         #subtitle = str_glue("As of {format(report_date, '%d %b')}"), 
         x = "Week",
         y = "Proportion missing",
         fill = "CalREDIE Variable") + 
    scale_color_discrete(name = "Variable", labels = c("Date of Hospitalization Missing", "Outcome Missing"))+
    scale_y_continuous(breaks = c(seq(0,1,0.1)))
    #theme_cowplot()#+
    #theme(legend.position = element_text("none"))

lines

Pivoting

Overview

(pivoting/melting etc.) Transforming datasets from wide-to-long, or long-to-wide…

Wide-to-long

Transforming a dataset from wide to long

Data

We start with data that is in a wide format, e.g. our linelist.

DT::datatable(linelist, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )

pivot_longer()

#tidyr::pivot_longer(linelist, dplyr::vars(-age, -date_of_hospitalisation), names_to = "variable", values_to = "value" )

Result

Long-to-wide

dplyr pivot_wider()

Data

Pivot wider

Result

Grouping/aggregating data

Overview

Tidyverse - grouping by values

.drop=F in group_by() command

group_by()

aggregate()